Telegram Group & Telegram Channel
🧩 The Ultimate LLM Benchmark Collection

Подборка живых бенчмарков, которые стоит открывать при каждом релизе новой модели — и тех, на которые можно больше не тратить время.

🌐 Общие (multi‑skill) лидерборды
SimpleBench — https://simple-bench.com/index.html

SOLO‑Bench — https://github.com/jd-3d/SOLOBench

AidanBench — https://aidanbench.com

SEAL by Scale (MultiChallenge) — https://scale.com/leaderboard

LMArena (Style Control) — https://beta.lmarena.ai/leaderboard

LiveBench — https://livebench.ai

ARC‑AGI — https://arcprize.org/leaderboard

Thematic Generalization (Lech Mazur) — https://github.com/lechmazur/generalization

дополнительные бенчмарки Lech Mazur:

Elimination Game — https://github.com/lechmazur/elimination_game

Confabulations — https://github.com/lechmazur/confabulations

EQBench (Longform Writing) — https://eqbench.com

Fiction‑Live Bench — https://fiction.live/stories/Fiction-liveBench-Mar-25-2025/oQdzQvKHw8JyXbN87

MC‑Bench (сортировать по win‑rate) — https://mcbench.ai/leaderboard

TrackingAI – IQ Bench — https://trackingai.org/home

Dubesor LLM Board — https://dubesor.de/benchtable.html

Balrog‑AI — https://balrogai.com

Misguided Attention — https://github.com/cpldcpu/MisguidedAttention

Snake‑Bench — https://snakebench.com

SmolAgents LLM (из‑за GAIA & SimpleQA) — https://huggingface.co/spaces/smolagents/smolagents-leaderboard

Context‑Arena (MRCR, Graphwalks) — https://contextarena.ai

OpenCompass — https://rank.opencompass.org.cn/home

HHEM (Hallucination) — https://huggingface.co/spaces/vectara/leaderboard

🛠️ Coding / Math / Agentic
Aider‑Polyglot‑Coding — https://aider.chat/docs/leaderboards/

BigCodeBench — https://bigcode-bench.github.io

WebDev‑Arena — https://web.lmarena.ai/leaderboard

WeirdML — https://htihle.github.io/weirdml.html

Symflower Coding Eval v1.0 — https://symflower.com/en/company/blog/2025/dev-quality-eval-v1.0-anthropic-s-claude-3.7-sonnet-is-the-king-with-help-and-deepseek-r1-disappoints/

PHYBench — https://phybench-official.github.io/phybench-demo/

MathArena — https://matharena.ai

Galileo Agent Leaderboard — https://huggingface.co/spaces/galileo-ai/agent-leaderboard

XLANG Agent Arena — https://arena.xlang.ai/leaderboard

🚀 Для отслеживания AI take‑off
METR Long‑Task Benchmarks (вкл. RE Bench) — https://metr.org

PaperBench — https://openai.com/index/paperbench/

SWE‑Lancer — https://openai.com/index/swe-lancer/

MLE‑Bench — https://github.com/openai/mle-bench

SWE‑Bench — https://swebench.com

🏆 Обязательный «классический» набор
GPQA‑Diamond — https://github.com/idavidrein/gpqa

SimpleQA — https://openai.com/index/introducing-simpleqa/

Tau‑Bench — https://github.com/sierra-research/tau-bench

SciCode — https://github.com/scicode-bench/SciCode

MMMU — https://mmmu-benchmark.github.io/#leaderboard

Humanities Last Exam (HLE) — https://github.com/centerforaisafety/hle

🔍 Классические бенчмарков

Simple‑Evals — https://github.com/openai/simple-evals

Vellum AI Leaderboard — https://vellum.ai/llm-leaderboard

Artificial Analysis — https://artificialanalysis.ai

⚠️ «Перегретые» метрики, на которые можно не смотреть
MMLU, HumanEval, BBH, DROP, MGSM

Большинство чисто‑математических датасетов: GSM8K, MATH, AIME, ...

Модели близки к верхним значениям на них и в них нет особого смысла.



tg-me.com/data_analysis_ml/3536
Create:
Last Update:

🧩 The Ultimate LLM Benchmark Collection

Подборка живых бенчмарков, которые стоит открывать при каждом релизе новой модели — и тех, на которые можно больше не тратить время.

🌐 Общие (multi‑skill) лидерборды
SimpleBench — https://simple-bench.com/index.html

SOLO‑Bench — https://github.com/jd-3d/SOLOBench

AidanBench — https://aidanbench.com

SEAL by Scale (MultiChallenge) — https://scale.com/leaderboard

LMArena (Style Control) — https://beta.lmarena.ai/leaderboard

LiveBench — https://livebench.ai

ARC‑AGI — https://arcprize.org/leaderboard

Thematic Generalization (Lech Mazur) — https://github.com/lechmazur/generalization

дополнительные бенчмарки Lech Mazur:

Elimination Game — https://github.com/lechmazur/elimination_game

Confabulations — https://github.com/lechmazur/confabulations

EQBench (Longform Writing) — https://eqbench.com

Fiction‑Live Bench — https://fiction.live/stories/Fiction-liveBench-Mar-25-2025/oQdzQvKHw8JyXbN87

MC‑Bench (сортировать по win‑rate) — https://mcbench.ai/leaderboard

TrackingAI – IQ Bench — https://trackingai.org/home

Dubesor LLM Board — https://dubesor.de/benchtable.html

Balrog‑AI — https://balrogai.com

Misguided Attention — https://github.com/cpldcpu/MisguidedAttention

Snake‑Bench — https://snakebench.com

SmolAgents LLM (из‑за GAIA & SimpleQA) — https://huggingface.co/spaces/smolagents/smolagents-leaderboard

Context‑Arena (MRCR, Graphwalks) — https://contextarena.ai

OpenCompass — https://rank.opencompass.org.cn/home

HHEM (Hallucination) — https://huggingface.co/spaces/vectara/leaderboard

🛠️ Coding / Math / Agentic
Aider‑Polyglot‑Coding — https://aider.chat/docs/leaderboards/

BigCodeBench — https://bigcode-bench.github.io

WebDev‑Arena — https://web.lmarena.ai/leaderboard

WeirdML — https://htihle.github.io/weirdml.html

Symflower Coding Eval v1.0 — https://symflower.com/en/company/blog/2025/dev-quality-eval-v1.0-anthropic-s-claude-3.7-sonnet-is-the-king-with-help-and-deepseek-r1-disappoints/

PHYBench — https://phybench-official.github.io/phybench-demo/

MathArena — https://matharena.ai

Galileo Agent Leaderboard — https://huggingface.co/spaces/galileo-ai/agent-leaderboard

XLANG Agent Arena — https://arena.xlang.ai/leaderboard

🚀 Для отслеживания AI take‑off
METR Long‑Task Benchmarks (вкл. RE Bench) — https://metr.org

PaperBench — https://openai.com/index/paperbench/

SWE‑Lancer — https://openai.com/index/swe-lancer/

MLE‑Bench — https://github.com/openai/mle-bench

SWE‑Bench — https://swebench.com

🏆 Обязательный «классический» набор
GPQA‑Diamond — https://github.com/idavidrein/gpqa

SimpleQA — https://openai.com/index/introducing-simpleqa/

Tau‑Bench — https://github.com/sierra-research/tau-bench

SciCode — https://github.com/scicode-bench/SciCode

MMMU — https://mmmu-benchmark.github.io/#leaderboard

Humanities Last Exam (HLE) — https://github.com/centerforaisafety/hle

🔍 Классические бенчмарков

Simple‑Evals — https://github.com/openai/simple-evals

Vellum AI Leaderboard — https://vellum.ai/llm-leaderboard

Artificial Analysis — https://artificialanalysis.ai

⚠️ «Перегретые» метрики, на которые можно не смотреть
MMLU, HumanEval, BBH, DROP, MGSM

Большинство чисто‑математических датасетов: GSM8K, MATH, AIME, ...

Модели близки к верхним значениям на них и в них нет особого смысла.

BY Анализ данных (Data analysis)


Warning: Undefined variable $i in /var/www/tg-me/post.php on line 283

Share with your friend now:
tg-me.com/data_analysis_ml/3536

View MORE
Open in Telegram


Анализ данных Data analysis Telegram | DID YOU KNOW?

Date: |

Should You Buy Bitcoin?

In general, many financial experts support their clients’ desire to buy cryptocurrency, but they don’t recommend it unless clients express interest. “The biggest concern for us is if someone wants to invest in crypto and the investment they choose doesn’t do well, and then all of a sudden they can’t send their kids to college,” says Ian Harvey, a certified financial planner (CFP) in New York City. “Then it wasn’t worth the risk.” The speculative nature of cryptocurrency leads some planners to recommend it for clients’ “side” investments. “Some call it a Vegas account,” says Scott Hammel, a CFP in Dallas. “Let’s keep this away from our real long-term perspective, make sure it doesn’t become too large a portion of your portfolio.” In a very real sense, Bitcoin is like a single stock, and advisors wouldn’t recommend putting a sizable part of your portfolio into any one company. At most, planners suggest putting no more than 1% to 10% into Bitcoin if you’re passionate about it. “If it was one stock, you would never allocate any significant portion of your portfolio to it,” Hammel says.

Dump Scam in Leaked Telegram Chat

A leaked Telegram discussion by 50 so-called crypto influencers has exposed the extraordinary steps they take in order to profit on the back off unsuspecting defi investors. According to a leaked screenshot of the chat, an elaborate plan to defraud defi investors using the worthless “$Few” tokens had been hatched. $Few tokens would be airdropped to some of the influencers who in turn promoted these to unsuspecting followers on Twitter.

Анализ данных Data analysis from us


Telegram Анализ данных (Data analysis)
FROM USA